An Empirical Comparison of Goodness Measures for Unsupervised Chinese Word Segmentation with a Unified Framework

نویسندگان

  • Hai Zhao
  • Chunyu Kit
چکیده

This paper reports our empirical evaluation and comparison of several popular goodness measures for unsupervised segmentation of Chinese texts using Bakeoff-3 data sets with a unified framework. Assuming no prior knowledge about Chinese, this framework relies on a goodness measure to identify word candidates from unlabeled texts and then applies a generalized decoding algorithm to find the optimal segmentation of a sentence into such candidates with the greatest sum of goodness scores. Experiments show that description length gain outperforms other measures because of its strength for identifying short words. Further performance improvement is also reported, achieved by proper candidate pruning and by assemble segmentation to integrate the strengths of individual measures.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Improving Chinese Word Segmentation with Description Length Gain

Supervised and unsupervised learning has seldom joined with and thus lend strength to each other in the field of Chinese word segmentation (CWS). This paper presents a novel approach to CWS that utilizes description length gain (DLG), an empirical goodness measure for unsupervised word discovery, to enhance the segmentation performance of conditional random field (CRF) learning. Specifically, w...

متن کامل

A New Unsupervised Approach to Word Segmentation

This article proposes ESA, a new unsupervised approach to word segmentation. ESA is an iterative process consisting of three phases: Evaluation, Selection, and Adjustment. In Evaluation, both the certainty and uncertainty of character sequence co-occurrence in corpora are considered as statistical evidence supporting goodness measurement. Additionally, the statistical data of character sequence...

متن کامل

Incorporating Global Information into Supervised Learning for Chinese Word Segmentation

This paper presents a novel approach to Chinese word segmentation (CWS) that attempts to utilize global information (GI) such as co-occurrence of sub-sequences and outputs of unsupervised segmentation in the whole text for further enhancement of the state-of-the-art performance of conditional random fields (CRF) learning. In the existing work of CWS, supervised and unsupervised learning seldom ...

متن کامل

A Simple and Effective Unsupervised Word Segmentation Approach

In this paper, we propose a new unsupervised approach for word segmentation. The core idea of our approach is a novel word induction criterion called WordRank, which estimates the goodness of word hypotheses (character or phoneme sequences). We devise a method to derive exterior word boundary information from the link structures of adjacent word hypotheses and incorporate interior word boundary...

متن کامل

New Word Detection for Sentiment Analysis

Automatic extraction of new words is an indispensable precursor to many NLP tasks such as Chinese word segmentation, named entity extraction, and sentiment analysis. This paper aims at extracting new sentiment words from large-scale user-generated content. We propose a fully unsupervised, purely data-driven framework for this purpose. We design statistical measures respectively to quantify the ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008